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Abstract 

We present two alternative ways to apply PAC-Bayesian analysis to sequences of dependent 
random variables. The first is based on a new lemma that enables to bound expectations 
of convex functions of certain dependent random variables by expectations of the same 
functions of independent Bernoulli random variables. This lemma provides an alternative 
tool to Hoeffding-Azuma inequality to bound concentration of martingale values. Our sec- 
ond approach is based on integration of Hoeffding-Azuma inequality with PAC-Bayesian 
analysis. We also introduce a way to apply PAC-Bayesian analysis in situation of lim- 
ited feedback. We combine the new tools to derive PAC-Bayesian generalization and regret 
bounds for the multiarmed bandit problem. Although our regret bound is not yet as tight as 
state-of-the-art regret bounds based on other well-established techniques, our results signif- 
icantly expand the range of potential applications of PAC-Bayesian analysis and introduce 
, a new analysis tool to reinforcement learning and many other fields, where martingales and 

' limited feedback are encountered. 
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PAC-Bayesian analysis was introduced ove r a decade ag o (jShawe- Taylor and Williamson! . 119971 
IShawe-Tavlor et al.l . 119981 iMcAllesteii [[998, Sccgcr, 2002;) and has since made a significant con- 
tribution to the analysis and development of supervised learning methods. The power of PAC- 
Bayesian approach lies in the successful marriage of flexibility and intuitiveness of Bayesian models 
with the rigor of PAC analysis. PAC-Bayesian bounds provide an explicit and often intuitive and 
easy-to-optimize trade-off between model complexity and empirical data fit, where the complex- 
ity can be nailed down to the resolution of individual hypotheses via the prior definition. The 
PAC-Bayesian analysis was applied to derive generalization bounds and new algorithms for lin- 
ear classifiers and ma ximum margin methods (lLangford and Shawe- Tavloii I2002L iMcAlle stcr. 2003, 
iGermain et al.l.l2009D. stru c tured prediction (McAllester, 2003), and clustering-based classification 
models (jSeldin and Tishbvl . |20 ldT) . to name just a few. However, the application of PAC-Bayesian 
analysis beyond the supervised learning domain r emained surprisingly limited. In fact, the only addi- 
tiona l domain known to us is density estimation (|Seldin and Tishbvl . 120101 IHiggs and Shawe- Tavloii 
I2010D . 

Even within supervised learning the applications of PAC-Bayesian analysis were restricted to 
i.i.d. data for a long time. The i ssue o f tre ating non-independ ent samples was partially addressed 
only recently by IRalaivola et al" (l2010l) and Lever et all (|2010l ) (their a pproaches are al s o suit able 



for density estimation ( Higgs and Shawe- Taylor , I2010T ) . The solution of IRalaivola et all (|2010l ) es- 
sentially boils down to breaking the sample into independent (or almost independent) subsets (which 
also reduces the effective sample size to the number of independent subsets). Such an approach is 
inappl icable to martinga les due to strong dependence of the cumulative sum on all of its compo- 
nents. iLever et al.l (|2010t ) employed Hoeffding's canonical decomposition of U-statistics into forward 
martingales and applied PAC-Bayesian analysis directly to these martingales. Our second approach 
to handling sequences of dependent samples by combining PAC-Bayesian analysis with Hoeffding- 
Azuma inequality is based on similar ideas. Our first approach to sequences of dependent samples 



is based on the new lemma that allows to bound expectations of functions of certain sequentially 
dependent random variables by expectations of the same functions of independent random variables. 

One of the most prominent and important fields of application of martingales is reinforce- 
ment learning. Some potential advantages of applying PAC-Baye sian analysis in re i nforce ment 
learning were recently pointed out by sever a l rese archers, including iTishbv and Polanil (|2010t ) and 
iFard and Pineaul (|2010f ). ITishbv and Polanil (|2010f ) suggested that the mutual information between 
states and actions in a policy can be used as a natural regularizer in reinforcement learning. They 
showed that regularization by mutual information can be incorporated into Bellman equations and 
therefore can be computed efficiently. Tishby and Polani conjectured that PAC-Bayesian analysis 
ca n be applied to jus t ify su ch form of regularization and provide generalization guarantees for it. 

IFard and Pineaul (|2010f) suggested a PAC-Bayesian analysis of batch reinforcement learning. 
They used the analysis to design an algorithm that is able to leverage the prior knowledge when it 
is informative and confirms the data distribution and ignores it when it is irrelevant. In the first 
case Bayesian learning algorithms perform well and in the second case PAC learning algorithms 
perform better, whereas Fard and Pineau showed that their algorithm performs on par with the 
best out of the two in all situations. However, the analysis of Fard and Pineau does not address the 
exploration-exploitation trade-off, which is the key feature of reinforcement learning. In their batch 
analysis they assume that every action was sampled in every state some minimal number of times 
and the bound decreases at the rate of a square root of the minimum over states and actions of the 
number of times an action was sampled in a state. Clearly, such an analysis is not applicable in 
online setting, since we do not want to sample "bad" actions many times, but then the bound does 
not improve with time. 

One of the reasons for the difficulty of applying PAC-Bayesian analysis to address the exploration- 
exploitation trade-off is the limited feedback (the fact that we only observe the reward for the action 
taken, but not for all the rest). In supervised learning (and also in density estimation) the empirical 
error for each hypothesis within a hypotheses class can be evaluated on all the samples and therefore 
the size of the sample available for evaluation of all the hypotheses is the same (and usually relatively 
large) . In the situation of limited feedback the sa mple from one ac t ion c annot be used to evaluate 
another action (that is the reason why the bound of lFard and Pineaul (|2010l ) depends on the minimum 
of the number of times any action was taken in any state, which is the minimal sample size available 
for evaluation of all state-action pairs). In online setting the sample size of "bad" actions has to 
increase sublinearly in the number of game rounds, which results in slow or even no convergence o f 
the bound. We resolve this issue by applying weighted sampling st rategy (ISutton and Barta Il998l) , 
which is commonly used in the analysis of non-stochastic bandits (|Auer et all . l2002bD . but has not 
been applied to the analysis of stochastic bandits previously. 

The usage of weighted sampling introduces two new difficulties. One is the dependence between 
the samples: the rewards we observe influence the distribution over actions we play and through 
this distribution influence the variance of the subsequent weighted sample variables. We handle 
this dependence using our new PAC-Bayesian approaches to sequences of dependent variables. At 
the moment both approaches yield comparable bounds, however each of the approaches has its own 
potential advantages that can be exploited in future work. 

The second problem introduced by weighted sampling is the growing variance of the weighted 
sample variables. Martingale bounding techniques used in this work do not enable to take full 
control over the vari ance, which explains the gap between our re sults and state-of-the-art bounds for 
mutliarmed bandits (|Auer et all l2002~al lAuer and Ortnerl feOlO). Tighter bounds can be achieved by 
comb ining PAC-Bayesian analysis with Bernstein-type inequality for martingales (iBeveelzimer et all 
2010). Such a combination will be presented in future work. 

The subsequent sections are organized as follows: Section[5]surveys the main results of the paper, 
Section [3] presents our bound on expectation of convex functions of sequentially dependent random 
variables and illustrates its application to derivation of an alternative to Hoeffding-Azuma inequality, 
Section H] provides a PAC-Bayesian analysis of the weighted sampling strategy based on the bound 
from Section O Section [3] provides PAC-Bayesian analysis of the weighted sampling strategy based 
on martingales, Section |5] derives a regret bound for the multiarmed bandit problem, and Section [7] 
concludes the results. 

2 Main Results 

One of the foundation stones of our paper is the following lemma that enables to bound expectations 
of convex functions of certain sequentially dependent random variables by expectations of the same 
functions of in dependent Bernoulli random variables. The lemma generalizes a preceding result of 
iMaurerl ((2004) for independent random variables and might have a wide interest on its own right far 
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beyond the PAC-B ayesian analysis . The lemma c an be used to derive an alternative to Hoeffding- 
Azuma inequality (Hocffding, 1963. [Azumal . fl967l ) . This alternative can be much tighter in certain 
situations (see our derivation and discussion of Lemma [7] in the next section). 

Lemma 1 Let X\,..,Xn be dependent random variables belonging to the [0,1] interval and dis- 
tributed by p(xi\X\, .., Xi—x), such that E[Xj|Xi, .., -Xj_i] = p for all i. Let Yi,..,Yn be inde- 
pendent Bernoulli random variables, such that EY$ = p for all i. Then for any convex function 
f : [0, 1]^ -» R : 

Ef(X 1 ,..,X N )<Ef{Y 1 ,..,Y N ). 

We present the subsequent results in the context of the multiarmed bandit problem, which is 
probably the most common problem in machine learning, where sequentially dependent variables 
are encountered. Let A be a set of actions (arms) of size |„4| = K and let a G A denote the actions. 
Denote by R(a) the expected reward of action a. Let 7r t be a distribution over A that is played at 
round t of the game. Let {Ai,A 2 , ...} be the sequence of actions played independently at random 
according to {tti, 7T2, ...} respectively. Let {Ri, R2, •■•} be the sequence of observed rewards. Denote 
by Tt — {{Ai, .., A t }, {i?i, ..,R t }} the set of taken actions and observed rewards up to round t (by 
definition 7t-i C Tt)- 

For t > 1 and a £ {1, .., K} define a set of indicator random variables {It}t,a : 



I" 



1, if A t = a 
0, otherwise. 



Define a set of random variables i?f = — \—rL?R t . In other words: 

f — 7-^Rt, if At = a 
[ 0, otherwise. 

Define: i?t(a) = j J2t=i ^t- ^ ot a distribution p over A define R(p) — E p ( Q )i?(a) and R t {p) = 
E p(a)^t(a)- 

For two distributions p and p, let KL(p\\p) denote the KL-divergence between p and p. For two 
Bernoulli random variables with biases p and q let kl(p\\q) = pin ^ + (1 — p) \n be an abbreviation 

for JSTiCljp, 1 - p] || [g, 1 - g]). 

We present two alternative results, the first applies Lemma Q] to handle sequences of dependent 
random variables and the second is based on combination of PAC-Bayesian analysis with Hoeffding- 
Azuma inequality. Then we compare the results and present a regret bound for the multiarmed 
bandit problem based on the first solution. 

2.1 PAC-Bayesian Analysis of Sequentially Dependent Variables Based on Lemma [1] 

Our first PAC-Bayesian theorem provides a bound on the divergence between Rt(pt) and R(pt) for 
any playing strategy pt throughout the game. 

Theorem 2 For any sequence of sampling distributions {tti, -k-i, ...} that are not zero for any a G A, 
where n t can depend on Tt—\, and for any sequence of "reference" ( "prior") distributions {pi, P2, ■•■} 
over A, such that pt is independent ofTt (but can depend on t), for all possible distributions p t given 
t and for allt>l simultaneously with probability greater than 1 — 5: 

ki(^Rt{ P t)\wr m R(pt)) < ^n^)+ 31n ft+i)-in^ 



where 



Irain 



< min 7r r (a). 

a, v ' 

l<T<t 



The number 7^™"™ lower bounds sampling probabilities for all the actions up to time t (Imin 
stands for "left minimum" or minimum of ir T (a) up to ["left to"] time t). 

The KL-divergence kl(p\\q) bounds the absolute difference between p and q as 



\p-q\<VM(p\\q)/2 (2) 

(|Cover and Thomas! . 1199 J ). Combined with ([T]) this relation yields (with probability greater than 
1-5): 

1 lKL(p t \\p t ) + 31n(t+l)-\n5 



TV 



Imin 
t 



R(pt)-Rt( Pt ) < -r—x ^"^ - v ' • (3) 



■it 
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2.2 Combination of PAC-Bayesian Analysis with Hoeffding-Azuma Inequality 

The result presented next is based on a combination of PAC-Bayesian analysis with Hoeffding-Azuma 
inequality. We introduce one more definition: 



ky (a) 



Er=l W T 



where w\ > for all t and r and JZ* =1 w\ > for all t. R)" (a) is a weighted sum of the samples. 
For a special case, where w\ = \ for all r, Rf (a) — Rt(a). 

Theorem 3 For any sequence of sampling distributions {tti, 7T2, ...} that are not zero for any a G A, 
where irt can depend on 71— i, and for any sequence of "reference" ( "prior") distributions {fix, ■■■} 
over A, such that fj, t is independent of % (but can depend on i), for any sequence of positive pa- 
rameters {Ai, A2, ...} and for any sequence of weighting vectors {w 1 ^ 2 , ...}, such that X t and w 1 are 
independent of % (but can depend on t), for all possible distributions pt given t and for all t > 1 
simultaneously with probability greater than 1 — 5: 



Rf (a) - R(a) 



< 



KL(jh\\ih) + |A| £* =1 (^4r) 2 + 21n(* + 1) + ^ 



X t Et=1 W r 



where 



Tr™ m <min7r t (o). 



(4) 



For the special case w\. = \ we obtain that with probability greater than 1 — 5: 



Rt(a)-R(a) 



< 



KL(p f \\fH 



2~P 



y^t 1_ 



•21n(t + l) 



By taking 



A t = 



•it: 1 - 



21n(t + l) + ln|Vg^-J- 



we obtain: 



R t (a)-R(a) 



< 



\ 



It 




(5) 



(6) 



2.3 Comparison of Theorem [2] with Theorem [3] 

It is interesting to compare Theorems[2]and|3]resulting from the two different approaches. Inequality 

(J3j) depends on _ rT ^ in = maxi< T < t | n min } 1 whereas ^ depends on | J2t=i ^^yi '- If 7r™ m are 

approximately equal for all r, then the two terms are approximately identical. However, a single 
small value of 7r™ m can increase the value of significantly for all t > r, while its relative 

1 



contribution to the average of 



will decrease with time. This property provides an advantage 



to Theorem [3] On the other hand, the stronger kl form ((T|) of Theorem [5] can potentially be an 
advantage for the bound based on Lemma [IJ but we did not exploit it in this work. 

1 



Since for our choice of sampling strategy 



1 y^t 



,^-1 7 Ji„v2 up to small constants, we 

present a regret bound based on Theorem[2]only. A regret bound based on Theorem[3]can be derived 
in a similar way and is identical to the bound presented below up to small constants. 

2.4 Regret Bound for Multiarmed Bandits 

We applied Theorem [5] to derive the following regret bound for the multiarmed bandit problem. 

K^H 1 / 4 and s t = K-^H- 1 ^ and for 



Theorem 4 For t < K 3 let n t (a) = for all a. Let j t 

t>(K 3 - 1) let 

TT t+1 (a) = pl xp (a) = (1 - Ke t+ i)pr p (a)+e t+ i, 



(7) 
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where 



P r(a) = 



and 



z{ P D 

Zip* 1 ') = ^e' tkt{a) . 

a 

Then for t > K 3 the per-round regret R(a*) — R(pl xp ) (where a* is the best action) is bounded by: 



(8) 



K 3 / 4 

R(a*) - R(p e t * p ) < t-ttt I 2.5 



ln(A)+31n(i + l)-ln(5 /31n(£ + 1) - ln<5 



2K 



2K 



(t + l)V4 

with probability greater than 1 — 8 for all rounds t simultaneously. This translates into a total regret 
of 0(K 3 / 4 t 3 / 4 ) (where O hides logarithmic factors) . 

Note that £t bounds ir t (a) from below for all a and t > K 3 . Furthermore, since St is a decreasing 
sequence it actually bounds 7r r (a) from below for all a and t <t. Hence, for the prediction strategy 
selected in Theorem [4] and for t > K 3 we can substitute 7rJ mm with e t in ([1]) and ([3]) . 

3 Proof of Lemma Q] and an Example of its Application 

We start with the proof of Lemma [T] and then illustrate how it can be applied to martingal es. 

Proof of Lemma [1] The proof follows the lines of the proof of Lemma 3 in iMaurerl (|2004f ) . 
Any point x — (x±, ..,xn) € [0, 1]^ can be written as a convex combination of the extreme points 
fj = (rji, € {0, 1}^ in the following way: 

*= e ( n o--**) n 

<?e{0,l} N \i:rii=0 i:?K=l / 

Convexity of / therefore implies 

m< e ( n o--**) n A fw> w 

f/e{04} N \i:»7*=0 i:?K=l / 

with equality if x € {0,1}^. At the next step iMaure uses independence of Xj-s, whereas we 

use the fact that their conditional expectation is constant. Taking expectation of both sides of © 
we obtain: 



e n c 1 -**) n *<)^ 

f/G{0,l} w \*:'?i=0 i:7?<=l 



E E *i....** 

E E ^l:-:^iV-l 

*7£{0,1} JV 

E E ^i,-,Xiv_i 

rl£{0,l}« 



n a- x «) n ^ 



E 



n (i n ^ 

i:r}i— i:rji—l 



Xi, .., X 



JV-1 



•Ex N [(1 ~ ^)(1 - X N ) + n N X N \X u ..,X N _ X ] 



m 
m 



n (i - n x * ) ■ k 1 - -p) + »m>] 

i:rn=0,i<N i:rn = l,i<N 



fin) 



(10) 



= e n c 1 -^) n H^) 

^£{0,1}" \»:»7*=0 i:>)i=l / 

= E Yl ,.., YN [f(Y)}. 

In (I10p we apply induction in order to replace Xj-s by p, one-by-one from the last to the first, same 
way we did it for A^v- H 
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3.1 Application to Martingales 

We apply Lemma [T] to derive an alternative to Hoeffding-Azuma inequality. The derivation is based 
on Markov's inequality and a concentration result for independent Bernoulli variables provided 
below. 

Lemma 5 (Markov's inequality) For a random variable X > with probability greater than 
1-6: 

X < -EX (11) 
o 

The concentration result for independent Be rnoulli variables is based on the method of types 
i n info rm ation theory rtCover and Thomast 119911 ). Its proof can be found in Seeger (2003).Bancricc 
(|2006l) . or lSeldin and Tishbvl (fcOlOD I 1 ! 



Lemma 6 Let X\, ..,-Xjy be i.i.d. Bernoulli random variables. Let S = -h Sj=i X ^ e their empir- 
ical average and S = ¥,Xi the expected value. Then: 

E Xl _ XN [e Nkl ^]<N + l. (12) 

Since KL-divergence is a convex function (jCover and Thom as. 1991) and exponent is convex and 
non-decreasing, e Nkl (s\\s) j g a j so a convex function. Therefore, by Lemma[T]we obtain that Lemma 
lUalso holds for Xi, ..,Xjy that belong to the [0, 1] interval and are sequentially dependent on each 
other as long as their conditional expectation E[Xj|Xi, is identical. 

Alternative to Hoeffding-Azuma Inequality Based on Lemmas [1] and [6] 

Now we are ready to present our alternative to Hocffding-Azuma's inequality. 

Lemma 7 Let X\, ..,Xn be a martingale difference sequence (meaning that E[Xi|Xi, = 0), 

such that Xi € [a,i,bi] for an arbitrary ai < and bi > 0. Let S\,..,Sn be a martingale, where 
Sj = X)i=i ■ Let a — min^ «i o/nd b = max^ bi and let Zi — (Xi — a) / (b — a) . Then with probability 
greater than 1 — 8 the following holds simultaneously: 



< s — (13 



\S N \<(b-a)^N\n^±l. (14) 

Proof of Lemma [7| By definition of Zf we have Z^ £ [0, 1] and E[Zi\Zi, ..,Zi_i\ = is iden- 
tical for all Zi. Hence, by Markov's inequality and combination of Lemma Q] with Lemma |6] with 
probability greater than 1 — 5: 



Taking logarithm and normalizing by N yields (|13j) . 

By relation ^ between Li-norm and KL-divergence (TT31 yields: 



1 N 

N ^ % 



b- 

i=l 



< 



2N 



From definitions, Xi = (b—a)Zi + a and Sn = (b — a)^2 i=1 Zi + Na. Simple algebraic manipulations 
yield O'- ' ■ 



1 It is possible to prove even stronger result of a form y/N < Ex 1 ,..,x N e Jvfc!(s||s) < 2y/N for TV > 8 using 
Stirling's approximation of the factorial (jMaurcr, 2004]) ■ For simplicity we use (|12p . 
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Comparison with Hoeffding-Azuma Inequality 

It is instruct ive to compare Lemma [7] with Hoeffding - Azum a inequality, which we cite below for the 
comparison (|Azumal . 119671 ICesa-Bianchi and Lug osi. 2006|). 

Lemma 8 (Hoeffding-Azuma Inequality) Let Si,..,Sn be a zero-mean martingale satisfying 
Si — Si-i £ [di, bi\, then for any A > 0: 

E[e A5jv ] < e (A2/8) ^-=i (b '- a ' )2 . 

It is easy to verify, using the same procedure we applied before, that Lemma [8] implies that with 
probability greater than 1 — 5: 



\S. 



N 



< 



^EilrC&i-^ + lnf 



and that the above expression is minimized by A 



;ln f/E 4 =i( & » ~ a i) yielding: 



\S 



N 



< 



\ 




(bi-m)* In-. 



(15) 



In a special case, where <Zj = a for all i and bi — b for all i, this further simplifies to: 



\S N \ < (b-a)^Nln~. 

Now we are ready to make the comparison. If <Zj-s and &j-s are equal (or almost equal) for all 
i, inequality (TH| matches Hoeffding-Azuma inequality up to In (AT 4- 1) factor (which can also be 
halved by using a tighter bound in (|12[0 . If aj-s and bi-s are not identical, inequality (fl4l) can be 
potentially much worse, since a single large (bi — a{) term will permanently increase (b — a), but 
its relative contribution to (|15p will decrease with the increase of N. However, when the empirical 
average is close to lower or upper limit of the domain interval the kl form of Lemma W\ in equation 
(fnfl) is much tighter than the relaxed L\ norm form in equation (|14[) (|McAllesterl . 120031) . Therefore, 
in situations, where the analysis can be carried out using the kl form of the bound, it might be 
preferable. 

4 Proof of Theorem [2] (PAC-Bayesian Bound Based on Lemma [I]) 

Our proof uses the following lemma, which lays at the basis of PAC-Bayesia n analysis from its incep- 
tion and takes its roots back in information theory and s tatistical physics ([Donsker and V aradhan. 
119751 iDupuis and Ellis! . 119971 iGravl 1201 ll iBanerieel I2006D . The lemma allows to relate all posterior 
distributions p to a single prior distribution /i. 



Lemma 9 For any measurable function <j)(h) on H and any distributions n(h) and p(h) on %, we 
have: 

E P (h)[0W] < KL(p\\p) + \nE^ h) [e^\ (16) 



Proof of Theorem [2} First, we show that R(a) = Ej- t [R t (a)]. Letp(r|a) be the distribution of the 
reward for playing arm a and let R a be a random variable distributed according to p(r\a). Then for 
any t: 



R(a) — E p ( r | Q )[i? a ] — Ep( r | a ) 
1 



7Tt(a) 



1 



-R a 



E 



p(r\a),ir t (a) 



itRt 



^p(r\a),7T t (a)[Rt 



1 



7T t (a) 



I?R a 



(17) 



where (|17p holds since if i t a = 1, then R t is distributed by p(r\a), and otherwise R t is irrelevant. 
Hence, we obtain that [Rt(a)} = Er t [\ J2t=i ^t] = R{a) for all a and t. 



Note that Rt(a) is a sum of t random variables belonging to the [0, 



interval. By scaling 



R(a) and Rt(a) by a factor of 7r™ m we scale the random variables to the [0,1] interval, where 
Lemmas [T] and IH] can be applied. 
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We apply PAC-Bayesian analysis to the scaled version of R(a) and i?t(o) for a fixed t: 

t ■ kl{^r n Rt{pt)\W t mm R{Pt)) = t • fc/(E pt(a) [^™ m 4(a)]||]Ep t ( Q )[7rr""i?(p)]) 

< E pt(a) [t ■ kl(n l t mm R t (a)\\7T l r n R(a))} (18) 

< KL(pt\\iM) +lnE / , t(a) [e*- w « TOi "fl*Wll^ m< "flW)] ; (19) 

where (fT8f is due to convexity of kl and (TT9l) is by Lemma HI 

The second term in (|19[) can be bounded with high probability: 

E Alt (a)[e*- fc(( "' ml " At(a)ll "'""" fl(a)) ] < iE rt E Alt(a) [e t - fci ^ mi "^(°)ll 7r ' m< "- R (^)] (20) 

fit 

= L^^ (a) ^ Tt [ e t-kl(< mm Ma)\\< mm B.(a))^ (21) 

<-k* + l), (22) 

where (|20|) holds with probability greater than 1 — 6 t by Markov's inequality (Lemma EJ, the inter- 
change of expectations in (|21[) is possible since /it is independent of 7t, and (1221) is by Lemma [1] and 
Lemma 111 Substitution of (|2"2"|) into (TT91 yields with probability greater than 1 — fi t : 



, ■ - , ■ KL(pt\\u t ) + In 
fc;(^™"ii t (p t )||7r^"i?(p t )) < - 



t 

Finally, by setting 5 t = t ( t +x) > n~ffp and applying union bound we obtain ([T]) for all t simul- 
taneously (it is well-known that Ylt=l t(t+i) ~ Et*li (j ~ TTl) = ■'■)■ ' 

The key ingredient that made the proof of Theorem [2] possible was Lemma [T] which enabled us 
to bound Er t [e*' fe '( ,r t"""-" t ( a )ll 7r ' ml ' l - R ( a ))] even though the variables {Rf, ..,R?} are dependent. 

5 Proof of Theorem [3] (PAC-Bayesian Analysis Based on 
Hoeffding-Azuma Inequality) 

In this section we provide an alternative PAC-Bayesian bound for (p t ) — R(pt)\ by using 
Hoeffding-Azuma inequality. 
Proof of Theorem GO Let 



Ml{a) = -Y,w t T {R a T -R{a)). 



t 

T = l 



Observe that M}{a), ..,M t '(a) is a martingale [since [M t l (a)] = Mj (a)] and 

M!(a) = (E*=i<) (Rf(°>) ~ Note that (M| - M*" 1 ) g [--^ ^] and EM* = 0. 

Hence, by Hoeffding-Azuma inequality (Lemma [8]), for all a: 



E 



Tt 



,A t (Et= 1 «'r)(fir < («)"«(«)) 



e M t *(a) 



< e 



By going back to the proof of Theorem [2] and replacing kl{-K l t min R t (a) \\n l t mm R(a)) with Ftf (a) - 
R(a) and substituting the bound on E Tt [e*' M(7r *'" l "^ t(a),|7r ' m '" i?(a)) ] with the bound on 

Er t [e At (^ =1 ""r) W we derived above we obtain that with probability greater than 1-^5 

for all pt 

klOhWih) + |a| e*=i 2 + 2 M* + 1) + in ! 

i?r (p t ) - i?(p t ) < ^ ~ \ 

and, by a symmetric argument applied to — M^(a), .., — M*(a), 

XL( Pt || M4 ) + ±A? Eli (^r) 2 + 21n(i + 1) 



Hence, both hold simultaneously with probability greater than 1 — 5 and yield ((4]). 
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6 Proof of Theorem [4] (The Regret Bound) 

In this section we derive a regret bound based on Theorem [5] We then discuss some possible ways 
to tighten the regret bound. 

The regret bound is derived for the special kind of posterior distribution p\ lp defined in (|7|) in 
Theorem^ which is used as sampling distribution nt+i for the next round of the game, as described 
in the theorem. Furthermore, we define a special kind of prior distribution ji^ cv as: 

Mr(a) = ^r) e7ti?(a) - (23) 

The prior depends on the true expected rewards R(a), but not on the sample and hence it is a 
legal prior. 

Proof of Theorem [4| Let a* be the action with the highest reward. The expected regret of the 
prediction strategy p~l* p at step t + 1 can be written as follows: 

R(a*)- R(pr ) = [%?*)- Rt (a*)} + [Rt (a* ) - 4 (fiV )\ + [A ipT' ) - R{pT p )] + [ W ) ~ W )] • 

(24) 

We bound the terms in ([24|) one-by-one. 

R(a* ) and Rt (a* ) are the expected and the empirical rewards of a prediction strategy, which is 
a delta distribution on a* . Hence, by Theorem [2J 



R(a ) - R t (a ) < 



1 /l n |^ + 31n(t + l)-ln5 



St V 2t 



where in ([25]) we used the fact that R(a*) > i?(a) for all a and hence e ltRt ^ a ' > 4* e 



7t fl(a) 



For the second term in (l24l) we write: 



Rt (a*) - Rt(pD = E(^*( a *) - Ma)) P r(a) 

a 

-y t Rt(a) 



Y^iMa*) - Rt(a)) 

a 

^(Rtia*) - R t (a)) 



e 1 



z(pD 

e -7 t (flt(o*)-Bt(o)) 
V , e -^(«*( a *)--"'*( a ')) 



if 

< -, (26) 

It 

where in (|26[) follows from the technical lemma below. The proof of the lemma is provided at the 
end of this section. 

Lemma 10 Let x% = and X2, •-, x n be n — 1 arbitrary numbers. For any a > and n > 2: 



n 



En 
i=l Xi - e < _ 

The third term in ((24)) is bounded by the following lemma adapted from [Lever et al.1 (120101) . The 
proof of this lemma is also provided at the end of this section. 

Lemma 11 For [i^ v and p^ p defined by l|23[) and (jSJ) under the conditions of Theorem [H the 
following holds simultaneously with the assertion of Theorem^ 



Rt { pD ~ R(PD\ < ^ + V31n(t + l)-ln*) . 



(27) 
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Finally, for the last term in ([M]) : 

R{ P r) - r{pd = X>r(a) - pr(a))R(a) 



<\hr-pr\\x (28) 



= \ E - (1 - Ke t+1 ) P r(a) - e t+1 \ 

a 
a 



2 

a 

= Ke t+ i- 

In (|28p we used the fact that i?(a) is bounded by 1 and p" p and p"t v are probability distributions. 
Gathering all the terms and substituting them back into (|24|) we obtain: 



*(«•) - W) < L HK) + SH t+ l)-ln5 + K 
s t V 2t 7 t 



7t 



By choosing j t = K^H 1 / 4 and e t = K^'H- 1 ' 4 we get: 



y/3\n(t + l)-]nS + Ke t+1 



R(a)-R( Pt )<^ TTjT7I ^ + 1+2 + V 2K + 1 J- 

By integration over t the total regret is bounded by 0(K 3 / 4 i 3 / 4 ), where O hides logarithmic 
factors. ■ 

6.1 Proofs of Technical Lemmas for Section [6] 

We conclude this section with proofs of the two technical lemmas used in the proof of the regret 
bound. 

Proof of Lemma HOt Since x\ = we have: 

E n rp.p—ocxi ~,,„-aXi 
i=1 J-jC Z^i=l J '« e 



ax,- 

J'=2 r 



< 



n 

< -, 
a 

where the last inequality follows from the fact that xe~ ax < —. ■ 

xe~ aXi \n(K) 

We note that by numerical simulations it seems that a tighter bound '_ aa; . < — 1 — - holds, 

but we were unable to prove it analytically. 

The proof of Lemma [TT] is adapted with minor modificatio ns from Lever et"ail (|201dT ) and is 
based on the following two lemmas, which are also adapted from lLever et al-H 2010T) and are proved 
right after the proof of Lemma ITT1 

Lemma 12 For pf^ and pf" defined by ((231) and ©: 

KL( P r\\»r) < n ([Rt(pr) - r(pD] + [r(pd - RM x ni) ■ m 

Lemma 13 For [i" p and p"^ defined by (|23[) and ([5]) under the conditions of Theorem fj| the 
following holds simultaneously with the assertion of Theorem^ 



5. (30) 
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Proof of Lemma QTJ Substitution of fl30J) into © yields (|27)l . 
Proof of Lemma I12t 



KL{ P r 1 1 ^r) = E^ xp («)in 



e7t-R(«)2(p t e ^) 



(«)7t(4(a) - i?(a)) - In ( ^ 



a 7t-R f (a) 



< 



7t [4(pD " R{PD\ ^ (X>"" (a)e^^W- fl ( a )H (31) 

7t ([i? 4 (pr) - W)] + [fl(MD - 4(mD]) ■ (32) 



In (|3Tj) we used the fact that gT^in = £«r P (°) e 74 (for any a) and in (f32j) we used the concavity 
of In. " ■ 

Proof of Lemma I13t By Theorem [2] and simultaneously with it we have: 



AW-) - w < ^ gLM "" ri " ) ^ 3 "" t+ """ a 



By substituting this into (j2"5)) we have: 

KL{pT P \W P ) < ^%VKL( P r\\vr) + 31n(t + 1) - ln<5 + V31n(f + 1) - In J. 

If KL(pl xp ||/^ p ) < we are done. Otherwise, by rearranging the terms we obtain: 

(KL(pr\\^r)) 2 - 2KLiprV) J %V3Ht+ I) "In* + f-7Sf) (31n(t + l)-ln<5) 

s (^)' jr£ W"<-> + (^) > < sta » +1 '- h "'- 

which together with the fact that KL{pf v \\pT") > implies the result. ■ 
7 Discussion 

We presented a lemma that allows to bound expectations of convex functions of certain sequentially 
dependent variables by expectations of the same functions of i.i.d. Bernoulli variables. We showed 
that this lemma can be used to derive an alternative to Hoeffding-Azuma inequality for convergence 
of martingale values. 

We presented two different approaches to PAC-Bayesian analysis of martingale-type sequentially 
dependent random variables, which was an important challenge for PAC-Bayesian analysis for a long 
time. Our contribution opens the possibility to apply PAC-Bayesian analysis in multiple domains, 
where sequentially dependent variables are encountered. For example, Theorems [2] and [3] can be 
used to bound convergence of uncountable number of parallel martingale sequences, where simple 
union bound does not apply. 

We answered positively an important open question whether PAC-Bayesian analysis can be 
applied under limited feedback and used to study the exploration-exploitation trade-off. Although 
our regret bound for the multiarmed bandit problem is far from state-of-the-art yet, we believe that 
this gap can be closed in future work. 

Multiarmed bandits are just the first tier in a whole hierarchy of reinforcement learning problems 
with increasing structural complexity, including continuum-armed bandits, contextual bandits, and 
reinforcement learning in discrete and continuous spaces. In many of these domains Bayesian ap- 
proaches and incorporation of prior knowledge have already proved beneficial in practice, but their 
rigorous analysis remains difficult to carry out. We believe that PAC-Bayesian approach will prove 
to be as useful for this purpose as it already proved itself in the domain of supervised learning. 
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